Archiver for EIA RECS #534

nilaykumar · 2025-01-22T19:34:00Z

Overview

Closes #519.

What problem does this address?
Implements archiver for EIA RECS.

What did you change in this PR?
Added archival of some of the 2020 .xlsx files.

Testing

How did you make sure this worked? How can a reviewer verify this?

pudl_archiver --datasets eiarecs \
    --initialize \
    --summary-file eiarecs-summary.json \
    --depositor fsspec \
    --deposition-path /tmp/eia-recs/

To-do list

Tasks

Give feedback

add other years (before 2020)
add PDF
add microdata
Update relevant documentation - like comments, docstrings, README, release notes, etc.
Review the PR yourself and call out any questions or issues you have
Options

nilaykumar · 2025-01-22T23:49:47Z

src/pudl_archiver/archivers/eia/eiarecs.py

+    # housing characteristics
+    {
+        "base_url": "https://www.eia.gov/consumption/residential/data",
+        "php_extension": "index.php?view=characteristics",


The RECS webpage is a little bit tricky. It has tabs containing different sequences of datafiles (and PDFs, etc.) that can't be reached directly from the base_url. Instead we need to tack on these strings to the end of the url, which are different for different tabs.

Might be worth defining the shape of this dictionary as a dataclass, but definitely not a blocking concern.

nilaykumar · 2025-01-22T23:51:19Z

src/pudl_archiver/archivers/eia/eiarecs.py

+        "prefix": "state-ce",
+        "pattern": re.compile(r"ce(\d)\.(\d{1,2})\.(.*)\.xlsx"),
+    },
+    # microdata


I'm skipping the microdata for now. Happy to come back and rewrite everything to be more general once I get a better sense from folks who have a sense of how deep we want to go down this rabbit hole.

Sounds good - we can revisit later. I'm unaware of the context of this rabbit hole - assuming you all had a discussion during the hackathon? @cmgosnell

nilaykumar · 2025-01-22T23:51:55Z

src/pudl_archiver/archivers/eia/eiarecs.py

+        zip_path = self.download_directory / f"eia-recs-{year}.zip"
+        data_paths_in_archive = set()
+        # Loop through different categories of data (all .xlsx)
+        for pattern_dict in LINK_PATTERNS:


We call get_hyperlinks multiple times: once for each tab of data.

What a fun webpage!

src/pudl_archiver/archivers/eia/eiarecs.py

jdangerx

This all looks pretty good - provided some feedback on the output file name munging but nothing's blocking.

Your TODO list looks pretty good - though I'll make a draft in Zenodo just so we can have something at all archived ASAP.

src/pudl_archiver/archivers/eia/eiarecs.py

jdangerx · 2025-01-27T23:02:29Z

src/pudl_archiver/archivers/eia/eiarecs.py

+        zip_path = self.download_directory / f"eia-recs-{year}.zip"
+        data_paths_in_archive = set()
+        # Loop through different categories of data (all .xlsx)
+        for pattern_dict in LINK_PATTERNS:


What a fun webpage!

jdangerx · 2025-01-27T23:03:23Z

src/pudl_archiver/archivers/eia/eiarecs.py

+        "prefix": "state-ce",
+        "pattern": re.compile(r"ce(\d)\.(\d{1,2})\.(.*)\.xlsx"),
+    },
+    # microdata


Sounds good - we can revisit later. I'm unaware of the context of this rabbit hole - assuming you all had a discussion during the hackathon? @cmgosnell

jdangerx · 2025-01-27T23:03:58Z

src/pudl_archiver/archivers/eia/eiarecs.py

+    # housing characteristics
+    {
+        "base_url": "https://www.eia.gov/consumption/residential/data",
+        "php_extension": "index.php?view=characteristics",


Might be worth defining the shape of this dictionary as a dataclass, but definitely not a blocking concern.

2015 methodology required some changes to allow for downloading html files.

jdangerx · 2025-01-29T23:45:48Z

Discussed on Slack with Nilay and I'm taking this over the finish line.

Draft record (currently including 2020, 2015, 2009) on Zenodo

cmgosnell

the draft archive looks pretty good! i have a few small specific suggestions but in general when possible to make the link patterns more generic i'd suggest doing that so there aren't so so many to add for each year and view and sub-view file group.

its also kind of a bummer to not have get_resources be able to check if there is a new year of data. i had to poke around a lot to find somewhere for CBECS. I wonder if you could just use "https://www.eia.gov/consumption/residential/data/previous.php" and search for something like ^/consumption/residential/data/(\d{4})/$

cmgosnell · 2025-01-30T15:16:32Z

src/pudl_archiver/archivers/eia/eiarecs.py

+    2020: {
+        "housing_characteristics": LinkSet(
+            url=_url_for(year=2020, view="characteristics"),
+            short_name="hc",


I'd suggest using the view name (or some other something longer) instead of the short name in the file names so the files are more decipherable.

yea if it were me i'd condense these two LinkSet attributes as just view and use the view name in the filename and construct the page url with _url_for in the get_year_resources

cmgosnell · 2025-01-30T15:23:37Z

src/pudl_archiver/archivers/eia/eiarecs.py

+        "microdata": LinkSet(
+            url=_url_for(year=2020, view="microdata"),
+            short_name="microdata",
+            pattern=re.compile(r"(recs.*public.*)\.csv"),
+            extension="csv",
+        ),
+        "microdata-codebook": LinkSet(
+            url=_url_for(year=2020, view="microdata"),
+            short_name="microdata",
+            pattern=re.compile(r"(RECS 2020 Codebook.*v.)\.xlsx"),
+            extension="xlsx",
+        ),


not blocking but i would suggest making these a little more generic so there aren't so many to maintain

they both have the year and they both have recs if your add re.IGNORECASE and you could (\.csv|\.xlsx) for different file types. this is kinda a high level overall suggestion to condense patterns when possible but not a hill i will die on.

cmgosnell · 2025-01-30T15:29:30Z

src/pudl_archiver/archivers/eia/eiarecs.py

+
+    async def get_resources(self) -> ArchiveAwaitable:
+        """Download EIA-RECS resources."""
+        for year in [2020, 2015, 2009]:


Suggested change

for year in [2020, 2015, 2009]:

for year in YEAR_LINK_SETS:

or see my suggestion in the main coment

cmgosnell · 2025-01-30T15:39:58Z

src/pudl_archiver/archivers/eia/eiarecs.py

+                matched_filename = (
+                    match.group(1)
+                    .replace(".", "-")
+                    .replace(" ", "_")


we've typically attempted to always use dashes instead of _ for file names

Suggested change

.replace(" ", "_")

.replace(" ", "-")

.replace("_", "-")

nilaykumar added 4 commits January 22, 2025 14:23

Initial work on eiarecs

693714f

Forgot to save the rebase edit (awkward)

de35d0e

Added consumption and state data

274b7dd

Adding files to zip

abd3566

nilaykumar commented Jan 22, 2025

View reviewed changes

Removing accidental zipping in loop

67cbc8c

nilaykumar commented Jan 22, 2025

View reviewed changes

src/pudl_archiver/archivers/eia/eiarecs.py Outdated Show resolved Hide resolved

e-belfer requested a review from jdangerx January 27, 2025 19:06

e-belfer assigned nilaykumar Jan 27, 2025

e-belfer added new-data community labels Jan 27, 2025

jdangerx reviewed Jan 27, 2025

View reviewed changes

e-belfer mentioned this pull request Jan 29, 2025

Add new archiver for EPA eGRID #549

Merged

jdangerx added 3 commits January 29, 2025 16:12

chore: replace output filename munging code

b08b0e1

feat: add 2020 microdata + methodology

1b87417

feat: add 2015

f06f7e3

2015 methodology required some changes to allow for downloading html files.

jdangerx requested a review from cmgosnell January 29, 2025 23:44

feat: add 2009 and historical 457 forms

ed70331

cmgosnell requested changes Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archiver for EIA RECS #534

Archiver for EIA RECS #534

nilaykumar commented Jan 22, 2025 •

edited by jdangerx

Loading

Tasks

nilaykumar Jan 22, 2025

jdangerx Jan 27, 2025

nilaykumar Jan 22, 2025

jdangerx Jan 27, 2025

nilaykumar Jan 22, 2025

jdangerx Jan 27, 2025

jdangerx left a comment

jdangerx Jan 27, 2025

jdangerx Jan 27, 2025

jdangerx Jan 27, 2025

jdangerx commented Jan 29, 2025 •

edited

Loading

cmgosnell left a comment

cmgosnell Jan 30, 2025

cmgosnell Jan 30, 2025

cmgosnell Jan 30, 2025

cmgosnell Jan 30, 2025 •

edited

Loading

cmgosnell Jan 30, 2025

Archiver for EIA RECS #534

Are you sure you want to change the base?

Archiver for EIA RECS #534

Conversation

nilaykumar commented Jan 22, 2025 • edited by jdangerx Loading

Overview

Testing

To-do list

Tasks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx commented Jan 29, 2025 • edited Loading

cmgosnell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmgosnell Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilaykumar commented Jan 22, 2025 •

edited by jdangerx

Loading

jdangerx commented Jan 29, 2025 •

edited

Loading

cmgosnell Jan 30, 2025 •

edited

Loading